-
Notifications
You must be signed in to change notification settings - Fork 53
Enable custom training job in SageMaker plugin #113
Conversation
…anism; some other smaller refactoring
Codecov Report
@@ Coverage Diff @@
## master #113 +/- ##
==========================================
+ Coverage 58.75% 61.87% +3.12%
==========================================
Files 99 104 +5
Lines 5957 6434 +477
==========================================
+ Hits 3500 3981 +481
+ Misses 2106 2084 -22
- Partials 351 369 +18
Continue to review full report at Codecov.
|
… can access the DataStore and such
…and built-in training plugin
@bnsblue can we update the description? |
return trainingJob, nil | ||
} | ||
|
||
func (m awsSagemakerPlugin) getTaskPhaseForTrainingJob( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this different for every job type? or are the jobSTatus common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output handling is different for different type of tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the job statuses are not common either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I realized once I went through all of them
OutputDataConfig: &commonv1.OutputDataConfig{ | ||
S3OutputPath: ToStringPtr(outputPath), | ||
}, | ||
CheckpointConfig: nil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is one thing we might want to model in the future. Passing in a custom checkpoint config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we will do this soon
func (m awsSagemakerPlugin) getEventInfoForJob(ctx context.Context, job k8s.Resource) (*pluginsCore.TaskInfo, error) { | ||
|
||
var jobRegion, jobName, jobTypeInURL, sagemakerLinkName string | ||
if m.TaskType == trainingJobTaskType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move these parts as submethods in the specific files? co-located code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get what you mean. Mind to elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when task type == custom that code block could be in the custom.go file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done refactoring
@bnsblue This is awesome work! Thank you. Couple comments from my side |
Would you mind elaborating some of your comments? |
Will do |
@kumare3 please take another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me!
* checking key existence in GetRole() * (training job plugin only) inject hp; refactor image uri getting mechanism; some other smaller refactoring * change the key and value of the flyte sm command * get rid of a obsolete function * add and change unit tests * add unit tests * lint * refactor a dummy object generation function * refactor a dummy object generation function * add unit tests * lint * forming the runner cmd and inject it as a hyperparameter in SageMaker plugin * fix unit tests * fix unit tests * add custom training plugin * revert overriden changes * compose the __FLYTE_SAGEMAKER_CMD__ in custom training job plugin * add unit test * lint * fix getEventInfoForJob * fix image getting * fix unit tests * stick with inputs.pb and modify hp injecting logic accordingly * fix args converting logic * use default file-based output for custom training job * expanding PluginContext interface with necessary methods so SM plugin can access the DataStore and such * lint error * add unit tests * add logic to inject env vars into hyperparameters * fix output prefix * fix output prefix * remove job name from output prefix for now * fix a unit test * accommodating new arg and env var passsing syntax * injecting a env var to force disable statsd for sagemaker custom training * correcting variable name * remove unused constant * remove comments * fix unit tests * merge template.go * pr comments * add guarding statement wrt algorithm name for custom training plugin and built-in training plugin * refactor file structures: splitting the code into multiple files to optimize for readability * add documentations to a set of constants, and fix a constant's name * split tests into multiple files * correcting error types: make permanent failures * refactor
TL;DR
This PR adds a flyteplugin to properly generate the CRD for SageMaker custom training job and monitor the status of the submitted job.
Type
Are all requirements met?
Complete description
See flyteorg/flyte#451
Tracking Issue
https://github.com/lyft/flyte/issues/
Follow-up issue
NA
OR
https://github.com/lyft/flyte/issues/